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1  Introduction 


Consider  the  multiple  regression  model 

Vn  =  XnP  +  (1.1) 

where  Xn  is  an  n  x  p  matrix,  (3  is  a  p-vector  of  unknown  regression  parameters  and  en  is 
a  random  error  vector.  Each  component  of  (3  may  be  zero  or  nonzero.  Each  subset  M  of 
Oi 2,  •••,/>}  is  called  a  sub- model.  It  is  obvious  that  there  are  2P  possible  sub- models  for  the 
multiple  regression  problem.  A  sub-model  is  called  a  true  model  if  /?,  =  0  for  all  i  g  M.  The 
problem  is  to  find  the  smallest  true  model  which  is  defined  to  be  the  one  whose  all  proper 
sub- models  are  not  true  models. 

Many  model  selection  rules  have  been  proposed  in  the  literature  for  choosing  the  smallest 
true  model  of  the  multiple  regression  problem.  Cross-validation  is  a  popular  method  for 
selecting  the  true  model,  which  selects  the  sub-model  such  that  it  gives  the  best  average 
prediction  error  for  the  observations.  Reference  may  be  made,  among  others,  to  Stone  (1974, 
1977a, b),  Geisser  (1975),  Efron  (1983,  1986),  Picard  and  Cook  (1984)  and  Rao  (1987).  When 
the  number  k  of  predictors  is  fixed,  the  cross-validation  is  equivalent  to  Akaike’s  AIC  which 
does  not  provide  a  consistent  procedure.  Shao  (1993)  showed  that  k/n  -»  1  as  n  ->  oo  is 
needed  to  guarantee  the  selected  model  to  be  asymptotically  correct.  When  k  is  large,  the 
amount  of  computation  required  for  the  cross-validation  approach  is  in  fact  impractical.  For 
reducing  the  computations  with  cross-validation  for  large  k,  several  approaches  have  been 
proposed  in  Shao  (1993)  and  their  performances  are  examined  by  simulation  studies. 

Based  on  the  prediction  errors,  the  FPE„  criterion  is  suggested.  For  references,  see  Akaike 
(1970,  1974),  Atkinson  (1980),  Shibata  (1986),  and  others. 

An  alternative  procedure  of  model  selection  is  the  so-called  general  information  criterion 
(GIC),  dating  back  to  Akaike’s  AIC  (1970,  1973).  Further  work  in  this  direction  can  be 
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found  in  Mallows  (1973),  Schwartz  (1978),  Hanna  and  Quinn  (1979),  Shibata  (1984)  and 
Zhao  et  al.  (1986). 

Regarding  the  relation  between  FPEa  and  GIC,  it  seems  that  GIC  is  more  general  than 
FPEo,.  For  example,  the  criterion  proposed  in  Rao  and  Wu  (1989)  is  an  FPEa,  but  it  can 
also  be  viewed  as  a  case  of  GIC.  For  the  performance  of  the  criterion,  it  is  shown  in  Rao 
and  Wu  (1989)  that  if  a  is  chosen  such  that  a/n  — ►  0,  and  a/  log  log  n  — ►  oo,  then  the 
criterion  selects  the  smallest  true  model  with  probability  one  under  some  mild  conditions. 
In  this  paper,  the  restriction  on  eu  will  be  relaxed  to  allow  for  the  components  of  en  to  be 
nonidentically  distributed.  Accordingly,  some  adjustments  will  be  made  in  the  criterion.  It 
will  be  shown  that  the  new  procedure  is  also  strongly  consistent. 

The  paper  is  organized  as  follows:  The  proposed  criteria  will  be  stated  and  investigated 
in  Section  2,  by  establishing  some  general  theorems  on  the  strong  consistency.  Section  3 
is  devoted  to  the  development  of  sample-dependent  penalty  functions.  Some  applications 
to  the  general  case  will  be  discussed  in  Section  4.  The  simulation  results  are  presented  in 
Section  5.  Discussions  and  comments  are  given  in  Section  6.  Some  technical  lemmas  are 
presented  in  the  Appendix. 

2  General  Model  Selection  Criteria 

Consider  the  regression  model  (1.1).  Denote  Xn  =  (xin  •  •  •  xpn)  =  ( x •  •  •  x (n))'.  Through¬ 
out  this  paper,  P{  stands  for  the  orthogonal  projection  operator  onto  the  space  spanned  by 
*in>  •  •  •  i®tn*  The  following  assumptions  are  needed  for  establishing  our  main  results. 

ASSUMPTION  1.  There  are  constants  «!  and  a2  such  that 

0  <  (LyJi  <  \p(X'nXn)  <  A,(A"  A'n)  <  n2n  (2.1) 

where  A i(X'nXn)  is  the  i-th  eigen  value  of  X'nXn. 
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ASSUMPTION  2.  There  is  a  constant  8  >  0  such  that  for  each  1  <  «  <  p, 

£(4J3  =  0[(x'inxinfl2  /  log1+5(ajJ-najt-n)]  (2.2) 

i= i 

where  is  the  jth  component  of  *,n  =  (x-n,  ■  ■  • 

ASSUMPTION  3.  The  components  of  en  =  (ei,...,en)'  are  independently  distributed 
with  zero  mean  and  satisfy  the  moment  conditions 


0  <  v2  <  E(e2),  E (|e;|3)  <  r3  <  oo  (2.3) 

for  all  1  <  i  <  n. 


We  first  consider  the  p  consecutive  sub-models  {M\, . . . ,  A/p},  where  Mk  denotes  the 
model  /3  =  (A»  •  ■  •  >  ftk  ^  0, 0, . . . ,  0)'.  Let  Sk  be  the  residual  sum  of  squares  under  the  model 
Mk-  Define  the  following  criterion  functions: 

(1)  <?„»(*)  =  S„  +  kC.S,/{n  -p),  k  =  1 _ _ 

(2)  G<?)(t)  =  S*  +  iC„,  *  =  l . p. 

(3)  G^(k)  =  n log 5'fc  +  kCn,  k=  1 . p; 


where  Cn  is  a  function  of  n  satisfying  the  conditions 


n 


Cn 

log  log  n 


(2.4) 


We  propose  the  following  selection  rules  based  on  the  criteria  G$' s;  the  selected  model  is 
defined  by  Mkn  for  which 

Gh](k)  =  mrn^ik). 

In  the  sequel,  we  shall  call  the  so-defined  selection  procedure  the  Criterion  (i). 


We  first  establish  the  following  theorem  of  the  strong  consistency  of  the  above  criteria. 


THEOREM  2.1.  Suppose  that  the  assumptions  1-3  hold  for  n  =  1,2,...  and  Mko  is  the 
smallest  true  model.  If  Cn  satisfies  (2.4),  then  with  probability  one,  for  all  large  n,  the 
criterion  (1)  chooses  the  smallest  true  model.  The  same  is  true  for  the  criterion  (2). 

In  order  to  prove  this  theorem,  we  need  the  following  lemma. 

LEMMA  2.1.  Suppose  that  the  assumptions  1-3  hold  for  n  =  1,2,...,  then 

(LI)  a2n  >  x'inXin  >  ajn,  as  n  —>  oo,  1  <i<p; 

(L2)  a2n  >  x'in(I  -  F,_i)*tn  >  a,n  >  0,  1  <  i  <  p; 

(L3)  x'inen  =  0((n  log  logn)1/2),  a.s.  1  <  i  <  p; 

(L4)  e'nPien  =  O(loglogn),  a.s.  1  <  i  <  p; 


(L5)  5Zr=i  e?/n  =  bounded  away  from  0  and  oo  almost  surely. 
(L6)  Sp/(n  —  p)  is  bounded  away  from  0  and  oo  almost  surely. 


PROOF.  Using  (2.1),  (LI)  and  (L2)  have  been  proved  in  Lemma  A.l.  The  assertions 
(L3)  and  (L4)follow  from  Assumptions  2-3  and  Lemmas  A.2-A.3.  Noting  that 


1  n  in  |  ?i 


n 


{=1 


n^i 


nr=, 


by  Assumption  3,  (L5)  is  a  consequence  of  Lemma  A.4.  Finally,  one  can  derive  (L6)  from 
(L4)  and  (L5). 


Proof  OF  Theorem  2.1.  Consider  the  case  that  k  <  k0.  By  (L1)-(L4)  of  Lemma  2.1 
and  Cauchy-Schwarz  inequality,  we  have 


Gn  \k)  -  G^(ko)  =  Sk  -  Sk0  +  (k  —  k0)CnSp/(n  -  p) 

^  +  lh0O{{n  log  log?i)1/2)  -  (k0  -  k)CuSp/(n  -  p)  a.s.  (2.5) 
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By  the  condition  that  n  xCn  -*  0  of  (2.4)  and  using  (L6)  of  Lemma  3.1,  one  shows  that 


G(k)U*)-Gl'\k0)>  0  a.s. 


Hence 


lim  inf  kn  >  k0  a.s. 


(2.6) 


Then,  consider  the  case  k  >  k0.  By  (L4)  of  Lemma  2.1,  with  probability  one,  for  all  large 
n,  we  have 


Gin(t)  - 

=  (k  -  k0)CnSp/{n  -  p)  -f  0(log  log  n)  (2.7) 

This,  together  with  the  condition  Cn/  log  logn  -*ooof  (2.4)and  (L6)  of  Lemma  3.1,  implies 
that 

-  G("(fe,)  <  0. 


This  proves 


lim  sup  £n  <  k0,  a.s. 

Combining  (2.6)  and  (2.8),  we  ultimately  obtain 


(2.8) 


k, 


a.s. 


Similarly,  the  second  assertion  of  the  theorem  can  be  proved.  The  proof  of  Theorem  2.1 
is  complete. 

The  following  theorem  is  concerned  with  the  strong  consistency  of  the  third  criterion. 
Although  its  statement  is  similar  to  those  of  the  previous  criteria,  there  are  some  differences 
in  the  proof  and  thus  we  state  and  prove  it  separately. 

THEOREM  2.2.  Suppose  that  the  assumptions  1-3  hold  for  n  =  1,2,...  and  Mk0  is  the 
smallest  true  model.  If  Cn  satisfies  (2.^),  then  the  criterion  (3)  is  strongly  consistent. 


Proof.  Note  that 


PK(U  -  Pj)X,0  +  2/3'X'n(I  -  Pj)en  +  <(/  -  PJe.,  if  j  <  k, 


0, 


S:  =  { 


[  e'n(I  —  Pj)en, 

By  (L4)-(L5)  of  Lemma  2.1,  we  have,  for  1  <  j  <  p, 

v2  +  o(l)  <  Sj/n  <  a2\(3\2  +  v2  +  o(l)  a.s. 

and 

Sj  -  sko 


if  j  >  k0. 


Sko 


=  < 


>  7/  -f"  °a.3.(l)>  if  j  <  k()i 


{  Oa.s.(n  1  log  log  n),  if  j  >  k0, 
where  j?  =  al0lo/(a2\(3\2  +  u2)  is  a  positive  constant. 

Let  k  >  ko.  Then,  by  (2.10),  (2.4)  and  (2.11),  we  conclude 

G?(*)  "  G?>(* o)  =  n  log  A  +  (*  -  iota. 


=  n 


o 

•J* 

Co 

1  c 

&ko  \ 

+  (&  —  ko)Cn 


which  implies  that 


=  0( log  log  n )  +  (fc  -  fc0)Cn  >  0  a.s. 


lim  sup  kn  <  k0  a.s. 


(2.9) 


(2.10) 


(2.11) 


(2.12) 


Next  let  k  <  k0.  Since  log(l  +  x)  is  an  increasing  function  of  x,  by  (2.11)  and  (2.4)  we 
have 

Gf(*)  -  Gf(t»)  =  n  log  p-  -  (ko  -  k)Cn 

Ji-o 

>  nlog  (1  +  r/  +  oa .,.(!))  —  [ko  —  k)Cn  >  0,  a.s. 
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which  implies  that 


liminf  kn  >  k0  a.s.  (2.13) 

The  results  (2.12)  and  (2.13)  establish  the  theorem. 

3  Data-Oriented  Penalty  Criteria 

In  the  criteria  proposed  in  Section  2,  the  selection  of  Cn  is  essential.  When  Cn  -  2, 
Criterion  (1)  reduces  to  the  well  known  AIC,  which  has  been  proved  to  be  inconsistent. 
Furthermore,  the  choice  C„  =  logn,  known  as  the  BIC,  is  a  special  case  of  Theorem  2.1., 
which  is  strongly  consistent.  Hannan  and  Quinn  (1979)  argued  that  the  minimal  choice  of  Cn 
to  guarantee  strong  consistency  is  c  log  log  n  for  some  positive  constant  c.  Although  this  result 
is  not  a  special  case  of  Theorem  2.1.  or  2.2,  by  using  the  upper  bound  in  our  proofs,  results 
similar  to  Hannan  and  Quinn  can  be  obtained.  However,  this  does  not  suggest  an  “optimal 
choice”  of  Cn  in  any  particular  case.  In  Bai,  Krishnaiah  and  Zhao  (1989),  it  is  proved  that 
higher  the  rate  of  the  order  of  Cn  the  better  is  the  performance  of  the  criterion.  However, 
this  is  only  an  asymptotic  result.  Choice  of  a  large  Cn  usually  gives  serious  underestimation 
of  the  order  of  the  model.  From  the  theorems  in  Section  2,  the  constant  Cn  needs  only  to 
satisfy  the  conditions  Cn/n  — >  0  and  C,,/ log  logn  — »  oo  to  guarantee  strong  consistency. 
However,  these  conditions  do  not  give  any  range  of  the  choice  of  Cn  for  a  given  n.  In  other 
words,  except  for  the  AIC  and  BIC,  the  selection  of  the  penalty  is  not  clearly  specified. 
Noting  that  the  AIC  is  inconsistent  and  the  BIC  does  not  give  the  best  convergence  rate 
of  the  probability  of  wrong  determination  of  the  model,  the  problem  of  optimal  selection  of 
the  penalty  function  Cn  remains  unsolved.  Rao  and  Wu  (1989)  proposed  a  data-oriented 
penalty  for  model  selection  in  linear  models.  Later,  Chen  cl  al  (1992)  used  a  data-oriented 
penalty  to  select  models  for  AR  time  series.  In  this  section,  we  shall  further  investigate  the 
model  selection  with  data-oriented  penalty. 
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As  an  example,  we  consider  the  Criterion  (1).  Similar  results  are  true  for  the  other 
two  criteria  and  the  details  are  omitted.  Let  a  sequence  of  experimental  measurements 
{(j/l)  *(1))i  •  •  •  ?  (j/n  ?a*(n))}  be  available.  Define,  for  a  given  integer  q  with  i  <  q  <  p, 

Xn(o)  =  (*ln  '  '  ‘  xqn)i  =  [ft  1  ,  •  •  •  i  ftq)  • 

If  the  model  Mq  is  true,  it  can  be  written  as 

Vn  =  Xn(q)0{q)  +  en. 


We  shall  use  the  following  steps  to  choose  the  penalty  Cn. 


1.  Compute  any  consistent  estimate  /?„  =  {ft  in,  •  ■  • ,  ftPn)'  of  ft .  For  example,  let  ftn  he 
the  least  square  estimate  of  (3  in  the  model  Mp. 

2.  Compute  ap  =  Sp/(n  —  p).  Let  ftn  =  (ftUo . . .  ,ftpn)'  be  defined  as  follows: 

_  f  ftin,  if  \ftin\  >  K, 

ftin  =  <  _  _  for  l  =  l,...,p, 

[  Ksign (/?,„),  if  \ftin\  <  AC, 

where  k  is  a  constant. 


3.  Compute  en  =  yn  -  Xnftn. 

4.  Let 

un(h)  =  Xn(h)ftn{h)  +  e„, 


for  h  =  1 , . . . ,  p.  Denote 


Dn(q,h)  =  Sq(h)-Sh(h), 


where  Sq(h)  =  ( un(h))'(I  —  Pq)un(h).  It  can  be  shown  that  Sp(h)  =  Sp  if  ftn 
Define 


At  h  =  min,</i 
A  2h  =  ma xq>h 
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Let  A h.  —  (Aj h  +  A2a)/2. 


5.  Define 


C(R)  _  average  of  {A/,,  h  =  l,...,p} 

i  + 

where  [b\  denotes  the  integer  part  of  b. 


Then,  Cn  is  set  to  be  C^RK 

REMARK.  The  constant  k  used  in  the  definition  is  determined  by  the  practical  requirement 
on  the  distinguishability  of  the  regression  coefficients  from  zero.  Intuitively,  a  small  choice 
of  it  will  over  estimate  the  model  and  vice  versa. 

We  establish  the  following  theorem  to  show  that  the  procedure  is  asymptotically  consis¬ 
tent. 


Theorem  3.1.  Under  the  assumptions  of  Theorem  2.1,  with  probability  one,  the  Criterion 
(1)  eventually  selects  the  smallest  true  model  if  Cn  is  chosen  as  C^R). 


PROOF.  By  Theorem  2.1,  we  need  to  show  that 

C<»>  .  .  Cl”> 


n 


0,  and  , 


log  log  n 


oo. 


(3.1) 


By  definition,  we  have 

Dn(q,h)  =  (un(h))'(Ph  -  Pq)un(h) 

=  (Xn(h)pn(h)  +  Xn(k0)/3(k0)  -  Xn(3n  +  en)'(Ph  -  Pq) 

{Xn{h)f3n(h)  +  Xn(kQ)/3(k0)  —  Xnpn  -f  en).  (3.2) 

Note  that  Xn(k0)l3(ko)  =  Xn/3  and  by  Lemma  2.1, 

=  moY  0,)/  +  (X'nXn)~x X'nen  =  (0(koy  0')'  +  Oa.s.(\[n~x  log  log?i), 
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which  implies  that 


Xn(k0)f3(k0)  -  Xnj3n  =  Oa.a.{\J log  logn). 


Consider  the  following  two  cases  for  each  fixed  h. 

Case  1.  q  >  h. 

In  this  case,  (Ph  -  Pq)Xn(h)  =  0.  Then,  (3.2)  turns  out  to  be 

Dn(q,h)  =  —  (Xn(k0)(3(k0)  —  Xnj3n  +  en)'(Pq  -  Ph)(Xn(k0)f3(k0)  —  Xnfin  +  en) 

=  -Oa.s.(loglogn). 

Note  that  Dn(q,h)  is  a  negative  number  of  order  0„.4.(log logn).  Thus,  A 2/t  is  a  positive 
number  of  order  Oa.4.(log  log^)- 

Case  2.  q  <  h. 

Note  that  /3n(h)  =  f3(h)  +  Oa.3.{\/n~1  l°gl°gn)i  where  (3  is  the  p-vector  whose  ith  element 
is  sign(^,)  max(|/3,|,  k).  By  Lemma  A. 1,  n~x (3(h)' Xn(h)'( P/t  —  Pq)Xn{h)(3{h)  is  bounded  away 
from  both  zero  and  infinity.  Therefore, 

Dn(q,h)  =  p(hyxn(hy(pk  -  Pq)X„(h)p(h)(l  +o(l))  a.s.  (3.3) 

which  is  positive  and  has  the  exact  order  as  n.  Combining  the  both  cases,  we  conclude  that 
has  the  exact  order  as  s/\ n .  This  shows  that  (3.1)  is  true  and  hence  completes  the  proof 
of  Theorem  3.1. 

For  the  Criteria  (2)  and  (3),  similarly  defining  the  data-oriented  penalty  C,[n\  we  can 
establish  results  similar  to  those  stated  for  Criterion  (1)  in  Theorem  3.1. 

The  small  sample  behavior  of  the  proposed  procedures  is  studied  by  Monte  Carlo  simu¬ 
lation  in  Section  5. 
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4  Extensions  of  the  Model  Selection  Criteria 


In  Section  2,  we  discussed  the  model  selection  from  the  p  consecutive  sub- models  {Aft, . . . , 
Mp}  associated  with  the  multiple  regression  model  (1.1).  As  mentioned  there,  we  actually 
have  2P  sub-models  since  each  component  of  j3  may  be  zero  or  not.  In  this  section,  we  shall 
extend  the  model  selection  for  all  these  possible  sub-models.  For  any  true  /3,  rearranging 
the  components  of  /3  and  the  columns  of  the  design  matrix  Xn,  we  can  get  an  equivalent 
regression  model  whose  smallest  true  model  is  one  of  the  sub-models  [Mu . . . ,  Mp).  Then, 
we  can  apply  the  criteria  introduced  in  Section  2.  Since  the  assumptions  do  not  change  under 
the  rearrangement,  the  estimated  model  is  still  consistent.  Select  the  smallest  Ic  among  the 
model  selections  for  all  rearrangements.  However,  this  approach  involves  a  huge  amount  of 
computation  if  p  is  large.  In  fact,  there  are  2P  residual  sum  of  squares  to  be  computed.  Here, 
we  suggest  leave  one  approach  (see  Rao  and  Wu  (1989))  to  select  the  smallest  true  model 
which  needs  only  the  computation  of  p  +  1  residual  sum  of  squares. 

For  each  1  <  i  <  p,  denote 


and 


Consider  the  model 


/ ^ -■ 


Xn,-i 


—  (Pi,  •  •  •  5  Pi-uPi+U  •  •  •  ,  ftp)' 

=  (*ln  '  '  '  ®t  — l,n®»'+l.n  *  *  '  ®pn)* 

Vn  =  +  Cn. 


Write  the  corresponding  usual  residual  sum  of  squares  by  Define,  for  1  <  i  <  p, 


GSH-i)  =  S-,  -  S„  -  CnSp/(n  -  p) 


(4.1) 


where  Cn  may  be  chosen  in  accordance  with  the  condition  (2.4),  or  as  the  random  penalty 
<?<*>  defined  in  last  section. 
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Then,  choose  the  model  as 


fti  =  0  and  0  if  G^H)  >  0 

i  =  (4.2) 

We  now  establish  the  following  theorem. 

THEOREM  4.1.  Under  the  conditions  of  Theorem  2.1,  the  estimated  model  by  the  rule 
(4-2)  is  strongly  consistent  for  the  smallest  true  model. 

Proof.  Suppose  that  in  the  true  model  ft,  ±  0.  By  (2.5)  with  k0  =  p  and  k  =  p- 1,  (L6) 
of  Lemma  2.1  and  (2.4),  we  have  G'^(— i)  >  0  almost  surely.  Therefore,  with  probability 
one,  fti  is  taken  to  be  non-zero  in  the  selected  model.  Conversely,  suppose  that  in  the  true 
model  fti  =  0.  By  (2.7)  with  k0  =  p-  1  and  k  =  p,  (L4)  and  (L6)  of  Lemma  2.1  and  (2.4), 
we  have  GjU(— i)  <  0  almost  surely,  which  implies  that  with  probability  one,  ft;  is  excluded 
in  the  selected  model.  This  completes  the  proof  of  the  theorem. 

Similar  to  (4.1),  one  may  define  for  each  1  <  i  <  p, 

c?»H)  =  s-i  -  s,  -  c„ 

or 

Gi3)H)  =  «(log  S-i  -  log  Sp)  -  cn, 

respectively.  Then  choose  the  model  by  letting 

fti  =  0  if  Glp(-i)  <  0  and  If  ±  0  if  Gjj»>(-t)  >  0 
f  =  1 , . . . ,  p, 


j  —  2  or  3. 


Under  the  conditions  of  Theorem  2.1,  one  can  show  that  with  probability  one  these 
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criteria  will  eventually  select  the  smallest  true  model.  The  proofs  are  similar  to  those  of 
Theorems  2.2  and  4.1,  and  thus  are  omitted. 

5  Monte  Carlo  Study 

In  this  section,  by  computer  simulations,  we  verify  the  small-sample  performance  of  the 
model  selection  rules  proposed  in  this  paper.  The  regression  model  is  assumed  to  be: 

l It  —  +  &2xli  +  hx'M  +  /?4^4i  +  (hx5 i  +  fill  ?  =  1,  .  .  .  ,  U, 

where  x  lt,...,  x5„  i=  l,...,n,  are  i  kl  A^O,  1)  random  variables.  In  the  simulations,  k  is  set 
to  be  0.01.  In  Tables  4.1  and  4.2,  ej,...,en  are  chosen  to  be  independently  distributed  as 
N(0,u2)  where  u  is  a  discrete  random  variable  uniformly  distributed  within  {1,...,5}.  In 
Tables  4.3,  4.4  and  4.5,  ei,...,en  are  chosen  to  be  independent  and  identically  distributed 
as  N(0,\)  random  variables.  In  the  tables,  RG'(l)  denotes  C(l)  with  the  use  of  of 
Section  3  as  the  choice  of  Cn  and  the  numbers  shown  in  the  tables  are  the  counts  of  the 
correct  selection  of  the  smallest  true  model  based  on  1,000  replications.  In  simulation,  IMSL 
subroutines  DRNNOF  and  RNUND  were  used  to  generate  the  random  numbers. 

From  the  Table  4.1,  it  is  seen  that  with  the  same  C„,  the  criterion  G'(l)  is  superior  to  the 
others  and  that  the  RC(1)  is  comparable  with  C(l).  The  criteria  AIC,  SW  and  HQ  based 
on  Akaike  (1970),  Schwarz  (1978)  and  Hannan  &  Quinn(1979)  respectively,  do  not  perform 
as  well  as  C(l)  and  RC(  1).  Table  4.2  shows  that  for  the  general  multiple  regression  model, 
the  performance  of  RC(1)  is  very  good,  absolutely  superior  to  all  the  others.  Comparing 
Tables  4.1  and  4.2,  one  finds  that  the  criterion  G(l)  with  Cn  =  5(Iogw)3  performs  for  the 
two  models  quite  differently  but  the  performance  of  RC(1)  is  very  stable  for  different  models. 
Comparing  Table  4.3  with  Table  4.4,  it  can  be  seen  that  in  either  case,  RC(1)  shows  a  very 
good  performance.  From  Tables  4.3  and  4.5,  it  can  be  seen  that  for  different  signal-to-noise 
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ratios,  the  performance  of  C(l)  depends  on  the  choice  of  Cn  but  RC(1)  automatically  adapts 
to  the  optimal  choice  of  C„s  for  different  signal-to-noise  ratios. 
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Table  4.1  Cn  =  5(log  n)3  and  (3  =  (6  3  7  0  0)' 


Sample  size 

c(i) 

C(2) 

C(3) 

RC(1) 

HQ 

15 

993 

973 

876 

923 

683 

714 

592 

20 

998 

975 

917 

951 

705 

743 

630 

25 

1,000 

975 

924 

969 

752 

786 

683 

30 

1,000 

975 

918 

982 

726 

769 

667 

35 

1,000 

981 

935 

992 

747 

792 

678 

40 

1,000 

978 

935 

995 

769 

804 

698 

45 

1,000 

985 

940 

1,000 

767 

819 

713 

50 

1,000 

976 

926 

997 

741 

779 

682 

Table  4.2  (3  =  (6  3  0  0  7)' 


Sample  size 

15 

35 

40 

45 

50 

C(l)  5 (log n)3 

23 

11 

23 

48 

80 

95 

130 

251 

C(l)  4(log  n)2 

293 

287 

464 

625 

754 

824 

876 

955 

C(l)  (log  n)3 

322 

182 

191 

212 

221 

186 

181 

273 

RC(1) 

801 

792 

897 

955 

967 

960 

946 

986 

15 


Table  4.3  0  =  (6  3  0  0  7)' 


Sample  size 

15 

20 

25 

30 

35 

40 

45 

50 

C(l)  5(logn)3 

946 

995 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

C(l)  log n 

752 

737 

747 

740 

731 

773 

742 

733 

C(2)  5(log  n)3 

999 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

C(2)  logn 

786 

773 

764 

763 

770 

782 

753 

734 

RC(1) 

998 

999 

1,000 

1,000 

1,000 

999 

1,000 

1,000 

Table  4.4  (3  =  (6  0  3  7  0)' 


Sample  size 

15 

20 

25 

30 

35 

40 

45 

50 

C(l)5(logn)s 

557 

983 

995 

1,000 

1,000 

1,000 

1,000 

1,000 

C(l)  logn 

699 

718 

708 

720 

736 

770 

729 

763 

C(2)  5(log  n)3 

524 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

1,000 

C(2)  logn 

743 

748 

739 

747 

754 

778 

747 

767 

RC(1) 

470 

963 

975 

1,000 

1,000 

999 

1,000 

1,000 

Table  4.5  (3  =  (1.2  1.5  0  0  1.3)' 


Sample  size 

15 

20 

25 

30 

35 

40 

45 

50 

C(l)  5(log  n)3 

34 

24 

86 

184 

310 

388 

515 

596 

C(l)  logn 

752 

737 

747 

740 

731 

773 

742 

733 

C(2)  5(logn)3 

0 

0 

3 

44 

209 

320 

478 

581 

C(2)  logn 

786 

773 

764 

763 

770 

782 

753 

734 

RC(1) 

912 

923 

980 

994 

992 

996 

993 

994 
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6  Discussions  and  Conclusions 


To  remedy  the  inconsistency  of  AIC,  various  criteria  were  proposed  in  the  literature.  The 
cross-validation  has  been  proved  to  be  equivalent  to  the  AIC.  Most  other  criteria  use  a  fixed 
choice  of  the  penalty  function  Cn  such  that  clog  log  n  <  Cn  =  o(n),  for  some  constant  c  >  0, 
to  guarantee  strong  consistency.  However,  a  fixed  choice  may  be  good  in  some  situations  and 
bad  in  some  other  situations.  As  shown  in  our  simulation,  the  criterion  with  a  data-oriented 
penalty  has  some  advantages. 

7  Appendix.  Preliminary  Lemmas 

Denote  the  eigenvalues  of  a  symmetric  matrix  A  of  order  k  by  Ai(A)  >  . . .  >  A<.(A).  The 
following  lemmas  are  used  in  the  proofs  of  the  main  results. 

LEMMA  A.l.  Let  &1, . . .  ,6p  be  n-vcc.tors  and  denote  IT,  =  B\ B{  where 

Bi  =  (6,  •••6,),  i  =  1 
If  there  exist  constants  r/i  and  such  that 


0  <  ?7i  <  Ap(lTp)  <  Ai(lTp)  <  r/2, 


then 


(1)  Vi  <  tybi  <  7/2,  1  <  i  <  p, 

(2)  <  6-Q,_i6,  <  r/2,  1  <  i  <  p, 

(3)  r„  <  A i-AB'iiP,  -  Pj)Bi)  <  Xi(B'l{Pl  -  Pj)B{)  <  7/2,  j  <  i,  (A.l) 
where  P ,  is  the  projection  matrix  onto  the  space  spanned  by  b\, . . . ,  6,-  and  Qt  =  [  —  Pt, 
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Proof.  For  any  vector  x  such  that  x'x  =  1,  we  have 


Vi  <  AP(WP)  <  x'Wpx  <  Ai(VFp)  <  t/2. 

Then  the  result  (i)  follows  by  choosing  x'  =  (0, . . . ,  0, 1, 0, . . . ,  0)  where  the  number  1  is  in 
the  z-th  position. 

By  the  interlace  theorem  (see  Sturmian  Separation  Theorem  in  Rao  (1973,  page  64)), 
Ai(Wi)  >  A^Wi.,)  >  Ai+1(Wi),  J  =  I,-.,*'  -  1.  (A.2) 


Note  that 

b'O  b  =  -ML  = 

■V'-'  '  A,(VVi_1)...A,_1((V£_1) 

so  that  by  (A.2) 

A,-(Wi)  <  b'iQi-xbi  <  A,(^). 


The  assertion  (2)  then  follows,  since,  using  (A.2)  once  again 


A P(WP)  <  Ai(Wi)  and  A, (IT,)  <  A,(WP)  for  i  <  p. 


Since  Xk((I  -  Pj)BiB\)  =  A k{B\{I  -  P3)BX)  and  A k(BxB\)  =  A *(£'£,),  for  k  =  1 . . 

by  the  interlace  theorem,  it  follows  that 

A WBi)  <  A -  P,)B ,)  <  A ,(B'(/=.  -  P^B,)  <  A ,(B'B{) 

which,  together  with  (A.2),  implies  the  conclusion  (3). 

LEMMA  A.2.  Let  Xn  =  (*|n  •  •  •  xkn),  where  xxn  ’$  arc  n-vectors.  Assume  that  en ’s  arc 
n-dimensional  random  vectors,  n  =  1,2,...,  such  that 

xinen  -  O(nloglogn)1/2,  1  <  i  <  k  (A.3) 
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and 


0  <cn<  A k{X'nXn).  (A. 4) 

Then 

e'nPnen  =  0(log  log  n),  a.s. 

where  Pn  =  Xn(X'nXn)~' X'n. 

Proof.  Let  yin  be  the  z-th  eigenvector  of  X'nXn  and  An  =  diag(A1(X;A'n), . . . ,  Xk(X'nXn)). 
Then  the  (z,  j)-th  element  of  (A£A„)-1  is 

I'inK'ljn  =  0(n~') 

using  the  condition  (A.4). 

Now  by  (A.3)  and  (A.4),  it  follows  that 

enPnen  —  e„An(AnAn)  1  Xnen  =  0(log  log  n) 

since  each  component  of  e'nXn  is  0{{n  log  log  7i)1/2)  and  each  element  of  (A' A',,)-1  is  0(rr' ). 
The  lemma  is  proved. 

LEMMA  A. 3.  Let  £x, e  2,...  be  a  sequence  of  independent  variables  with  zero  mean  such 
that  0  <  v2  <  E{ef)  =  a\  and  £(|e,|3)  <  r3  <  00  fori  =  1,2,....  Ifauat,...  is  a  sequence 
of  constants  such  that 


(I)  An  =  a'f  — *•  00,  as  n  — ♦  00; 

(II)  E”=i  N3  =  0(A3(logA2 )-(!+<)),  fa  some  6  >0, 

then,  almost  surely, 

=  ^(^n  l°g  log 


(A. 5) 
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PROOF.  Let  B%  =  £"=1  afaf  and  let  Fn  and  $  denote  the  distributions  of  B~x  a,e, 
and  the  standard  normal  random  variable  respectively.  Since  0  <  v2  <  af  and  F(|£,|3)  <  r3 
for  i  =  1,2,...,  by  the  Theorem  3  of  Petrov  (1975,  page  111)  and  Assumption  (II),  we  have, 
for  some  constant  M  >  0, 


supJFn(z) -$(*)!  <  MB~3  |a,|3E|et|3 

=  0(A-3E"=1kl3)  =  0((log^)-1-4). 


(A.6) 


Now  from  Assumptions  (I)  and  (II),  it  follows  that 


K-x  _  t  *lal  ,  L 


2 


K  -4. 


(A.T) 


By  Assumption  (I),  (A.6)  and  (A. 7),  (A.5)  follows  from  Theorem  3  of  Petrov  (1975,  page 
305). 


LEMMA  A. 4.  Suppose  that  £1,621  •  •  •  are  independently  distributed  random  variables  with 
zero  means  and  bounded  (1  +  8)th  moments  for  some  8  >  0.  Then 

1  n 

-^£,-►0  a.s. 

n,tt 


A  proof  of  this  lemma  can  be  found  in  Chung  (1974). 
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